Your world renowned work in cancer research, sports recruiting, advertising, and environmental research has lead the government to reach out an ask for help in better understanding the US populous. Your goal is to build a classifier that can predict the income levels in order to create more effective policy.
The dataset below includes Census data on 32,000+ individuals with a variety of variables and a target variable for above or below 50k in salary.
Your goal is to build a Random Forest Classifier to be able to predict income levels above or below 50k.
Calculate the base rate
length <- length(filter(census, income == "1")$income)
baserate <- length/length(census$income)
baserate
## [1] 0.2406342
Calculate the initial mtry level
mtry_initial <- round(sqrt(length(census)))
mtry_initial
## [1] 4
Run the initial RF model with 500 trees
census_RF
##
## Call:
## randomForest(formula = income ~ ., data = census_train, ntree = 500, mtry = mtry_initial, replace = TRUE, sampsize = 100, nodesize = 5, importance = TRUE, proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 15.67%
## Confusion matrix:
## 0 1 class.error
## 0 20561 1261 0.05778572
## 1 3248 3711 0.46673373
Review the percentage of trees that voted for each data point to be in each class.
census_votes <- as.data.frame(census_RF$votes)
kable(head(census_votes))
| 0 | 1 |
|---|---|
| 0.8740000 | 0.1260000 |
| 0.4088176 | 0.5911824 |
| 0.9639279 | 0.0360721 |
| 0.6860000 | 0.3140000 |
| 0.5322581 | 0.4677419 |
| 0.3400402 | 0.6599598 |
Review the “predicted” argument contains a vector of predictions for each data point.
head(as.data.frame(census_RF$votes))
## 0 1
## 1 0.8740000 0.12600000
## 2 0.4088176 0.59118236
## 3 0.9639279 0.03607214
## 4 0.6860000 0.31400000
## 5 0.5322581 0.46774194
## 6 0.3400402 0.65995976
head(as.data.frame(census_RF$predicted))
## census_RF$predicted
## 1 0
## 2 1
## 3 0
## 4 0
## 5 0
## 6 1
Determine the most important variable for your model
census_importance_DF <- as.data.frame(census_RF$importance)
kable(census_importance_DF[-c(1:2)])
| MeanDecreaseAccuracy | MeanDecreaseGini | |
|---|---|---|
| age | 0.0086944 | 3.1452175 |
| workclass | 0.0023666 | 1.6949648 |
| fnlwgt | 0.0003535 | 2.4118733 |
| education | 0.0116288 | 3.3959073 |
| education_num | 0.0117018 | 2.3795816 |
| marital_status | 0.0308497 | 3.2450132 |
| occupation | 0.0137208 | 4.8070253 |
| relationship | 0.0274160 | 3.2195532 |
| race | 0.0001544 | 0.3665042 |
| sex | 0.0018075 | 0.3422816 |
| capital_gain | 0.0146781 | 2.4104034 |
| capital_loss | 0.0014807 | 0.5492761 |
| hours_per_week | 0.0043147 | 1.7988305 |
| native_country | -0.0000891 | 0.4230629 |
Occupation is the most important variable with the highest mean decrease in Gini index, 4.8070253.
Visualize the model using plotly with lines tracking the overall oob rate and for the two classes of the target variable. What patterns do you notice?
census_RF_error = data.frame(1:nrow(census_RF$err.rate),
census_RF$err.rate)
colnames(census_RF_error) = c("Number of Trees", "Out of the Box",
"<=50K", ">50K")
census_RF_error$Diff <- census_RF_error$`>50K`-census_RF_error$`<=50K`
library(plotly)
fig <- plot_ly(x=census_RF_error$`Number of Trees`, y=census_RF_error$Diff,name="Diff", type = 'scatter', mode = 'lines')
fig <- fig %>% add_trace(y=census_RF_error$`Out of the Box`, name="OOB_Er")
fig <- fig %>% add_trace(y=census_RF_error$`<=50K`, name="<=50K")
fig <- fig %>% add_trace(y=census_RF_error$`>50K`, name=">50K")
fig
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
All graphs have a high variance with a low number of trees, but they begin to level out as the number of trees grow. In addition, the error for the >50k class is much higher than the <=50k class.
Pull the confusion matrix out of the model and discuss the results
census_RF$confusion
## 0 1 class.error
## 0 20561 1261 0.05778572
## 1 3248 3711 0.46673373
census_RF_acc = sum(census_RF$confusion[row(census_RF$confusion) ==
col(census_RF$confusion)]) /
sum(census_RF$confusion)
census_RF_acc
## [1] 0.8433188
The accuracy of the model is 84.33%.
Determine which tree created by the model has the lowest out of bag error
min_OOB <- min(census_RF_error$`Out of the Box`)
number_trees <- which(census_RF_error == min_OOB, arr.ind=TRUE)[,1]
number_trees
## row
## 93
Using the error.rate matrix select the number of trees that appears to minimize your error.
min_OOB_matrix <- min(census_RF$err.rate[,1])
number_trees_matrix <- which(census_RF$err.rate == min_OOB, arr.ind=TRUE)[,1]
number_trees_matrix
## row
## 93
Using the visualizations, oob error and errors for the poss and negative class, select a new number of trees and rerun the model
census_RF_2
##
## Call:
## randomForest(formula = income ~ ., data = census_train, ntree = 93, mtry = mtry_initial, replace = TRUE, sampsize = 300, nodesize = 5, importance = TRUE, proximity = FALSE, norm.votes = TRUE, do.trace = TRUE, keep.forest = TRUE, keep.inbag = TRUE)
## Type of random forest: classification
## Number of trees: 93
## No. of variables tried at each split: 4
##
## OOB estimate of error rate: 14.96%
## Confusion matrix:
## 0 1 class.error
## 0 20432 1390 0.06369719
## 1 2915 4044 0.41888202
Compare the new model to the original and discuss the differences
census_RF$confusion
## 0 1 class.error
## 0 20561 1261 0.05778572
## 1 3248 3711 0.46673373
census_RF_2$confusion
## 0 1 class.error
## 0 20432 1390 0.06369719
## 1 2915 4044 0.41888202
census_RF_2_acc = sum(census_RF_2$confusion[row(census_RF_2$confusion) ==
col(census_RF_2$confusion)]) /
sum(census_RF_2$confusion)
census_RF_2_acc
## [1] 0.8504079
The accuracy for the second model is 85.04%, and is 0.71% higher than the first model with a rate of 84.33%.
Use the better performing model to predict with your training set
census_predict = predict(census_RF_2, #<- a randomForest model
census_test, #<- the test data set to use
type = "response", #<- what results to produce, see the help menu for the options
predict.all = TRUE, #<- should the predictions of all trees be kept?
proximity = TRUE)
Use the confusion matrix function to assess the model quality
census_test_pred = data.frame(census_test,
Prediction = census_predict$predicted$aggregate)
confusionMatrix(census_test_pred$Prediction,census_test_pred$income,positive = "1",
dnn=c("Prediction", "Actual"), mode = "everything")
## Confusion Matrix and Statistics
##
## Actual
## Prediction 0 1
## 0 2330 310
## 1 131 426
##
## Accuracy : 0.8621
## 95% CI : (0.8496, 0.8738)
## No Information Rate : 0.7698
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.5745
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Sensitivity : 0.5788
## Specificity : 0.9468
## Pos Pred Value : 0.7648
## Neg Pred Value : 0.8826
## Precision : 0.7648
## Recall : 0.5788
## F1 : 0.6589
## Prevalence : 0.2302
## Detection Rate : 0.1332
## Detection Prevalence : 0.1742
## Balanced Accuracy : 0.7628
##
## 'Positive' Class : 1
##
Create a variable importance plot and discuss the results
census2_importance_DF <- as.data.frame(census_RF_2$importance)
kable(census2_importance_DF[-c(1:2)])
| MeanDecreaseAccuracy | MeanDecreaseGini | |
|---|---|---|
| age | 0.0118476 | 9.2277676 |
| workclass | 0.0023134 | 4.6685858 |
| fnlwgt | 0.0002341 | 8.0975631 |
| education | 0.0154417 | 9.7668028 |
| education_num | 0.0142407 | 6.1403563 |
| marital_status | 0.0372957 | 9.7215352 |
| occupation | 0.0157780 | 13.5910933 |
| relationship | 0.0315023 | 10.4430520 |
| race | 0.0002394 | 0.8735149 |
| sex | 0.0033935 | 1.1472097 |
| capital_gain | 0.0193705 | 7.3990921 |
| capital_loss | 0.0026479 | 2.2409120 |
| hours_per_week | 0.0047495 | 5.9747754 |
| native_country | -0.0001427 | 1.9255412 |
varImpPlot(census_RF_2,
sort = TRUE,
n.var = 10,
main = "Important Factors for Identifying Income Range",
#cex = 2, #<- size of characters or symbols
bg = "white",
color = "blue",
lcolor = "orange")
Occupation has the highest mean decrease in Gini index and the second highest mean decrease in accuracy. This highlights how important the occupation variable is. Relationship is also a very important variable, with the second-highest mean decrease in Gini index and the third-highest mean decrease in accuracy.
Use the tuneRf function to select a optimal number of variables to use during the tree building process, if necessary rebuild your model and compare the results the previous model.
set.seed(2)
census_RF_mtry = tuneRF(data.frame(census_train[ ,1:14]),
as.factor(census_train$income),
mtryStart = 5,
ntreeTry = 100,
stepFactor = 2,
improve = 0.05,
trace = TRUE,
plot = TRUE,
doBest = FALSE)
## mtry = 5 OOB error = 14.54%
## Searching left ...
## mtry = 3 OOB error = 13.84%
## 0.04804015 0.05
## Searching right ...
## mtry = 10 OOB error = 14.93%
## -0.02676864 0.05
Create a graphic that shows the size of the trees being grown for your final model.
hist(treesize(census_RF, terminal = TRUE), main="Tree Size")
Evaluate the final model using the performance/prediction functions in the ROCR package.
census_RF_prediction = as.data.frame(as.numeric(as.character(census_RF_2$votes[,2])))
census_train_actual = data.frame(census_train[,15])
census_prediction_comparison = prediction(census_RF_prediction, census_train_actual)
census_pred_performance = performance(census_prediction_comparison,
measure = "tpr", #<- performance measure to use for the evaluation
x.measure = "fpr")
census_rates = data.frame(fp = census_prediction_comparison@fp, #<- false positive classification.
tp = census_prediction_comparison@tp, #<- true positive classification.
tn = census_prediction_comparison@tn, #<- true negative classification.
fn = census_prediction_comparison@fn) #<- false negative classification.
colnames(census_rates) = c("fp", "tp", "tn", "fn")
tpr = census_rates$tp / (census_rates$tp + census_rates$fn)
fpr = census_rates$fp / (census_rates$fp + census_rates$tn)
census_rates_comparison = data.frame(census_pred_performance@x.values,
census_pred_performance@y.values,
fpr,
tpr)
colnames(census_rates_comparison) = c("x.values","y.values","fpr","tpr") #<- rename columns accordingly.
census_auc_RF = performance(census_prediction_comparison,
"auc")@y.values[[1]]
plot(census_pred_performance,
col = "red",
lwd = 3,
main = "ROC curve")+
grid(col = "black")+
abline(a = 0,
b = 1,
lwd = 2,
lty = 2,
col = "gray")+
text(x = 0.5,
y = 0.5,
labels = paste0("AUC = ",
round(census_auc_RF,
2)))
## integer(0)
Our final model was the model we made using the number of trees that minimized our OOB error. The model had an accuracy of .8621, much higher than the no information rate of .7698. Despite this high accuracy, we had a moderate kappa value of .5745. To get some more insights, we will look at our sensitivity and specificity. Our model struggled to predict the positive class, with a sensitivity of only .5788. On the other hand, we had a specificity of .9468. These values show how the accuracy of our model was likely inflated. The dataset was not balanced, with 24,283 individuals making less than $50k and only 7695 individuals making more than $50k. This means that despite struggling to predict the positive class, the success in predicting the negative class outweighs these struggles, inflating our accuracy. The model had an AUC of .9, but this high number is likely due to the issues we just highlighted. Looking at the Gini index and accuracy for each variable, occupation was clearly the most important, with education, age, and marital status also proving to be important.